In [1]:
# Auto reload function (if modified outside notebook)
%load_ext autoreload
%autoreload 2
# Plot precision
%config InlineBackend.figure_format ='retina'
# Interactive cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Add css tags in Notebook
from IPython.display import display_html 

import pandas as pd           # Handle dataset
import pydot                  # Export decision tree (image...)
from scipy import stats       # Stats modules
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc


from models_comparaison import *

feature_names = ['percentile1', 'percentile2', 'percentile3', 'percentile4', 'percentile5',
                 'persistence1', 'persistence2', 'persistence3', 'persistence4', 'persistence5',
                 'tree1', 'tree2', 'tree3']

Report : cloud detection - method analysis

Alexandre Bort

28/06/2019

Note:

This notebook does not emphasize on the code itself, but the output collected from the different process. If you want more code details, please refer to the decision_tree.py file and for the whole project to this GitHub repo.

Plan

 I. Context
 II. Current situation
 III. Methods
      III.A. Random Forest
            1. Number of estimators
            2. Tree overview
            3. Features importance
      III.B. Others
            1. Decision tree
            2. Logistic regression
 IV. With the 5 most important features
      IV.I. Features importance
      IV.II. Cuttof
      IV.III. Precision - recall
 V. Results

I. Context

  • Initial goal : label pixels in cloud or not cloud pixels.
  • How: 13 different methods (10 based on difference between the image and a non cloud image reference - 3 ready to use decision trees).
  • Dataset : public labelled dataset (repo) answering the question if a pixel is a cloud or not.
  • Goal of the report: build a robust cloud detection model based on our set of methods

II. Current situation

  • Methods processing: the main bugs present in our methods have been fixed
    • Non-square image:
      • all the image haven't the same shape -according to the satellite path)
      • solved by adding a common area criteria
    • Duplicated images:
      • some images have been sampled at the same date, but provided at different dates. As a result, there are duplicate images with different ids
      • solved by filtering the sample date and distribution date
    • Full mask bands:
      • some extra bands are always empty (QA10 - QA20)
      • solved by ignoring those problematic bands

  • After processing all those images, the methods still fail on some pixels. That leads us to remove:

    • 3036 pixels from the 200 000 as input (1.5%) on the Training set
    • 924 pixels from the 60 000 as input (1.5%) on the Evaluation dataset

    => The final training (respectively evaluation) dataset is composed by 196964 (resp. 59076) samples (pixels).

The dataset looks like this:

In [2]:
# Load training - evaluation dataset
training = pd.read_excel("Data/results.xlsx", sheet_name="Training")
evaluation = pd.read_excel("Data/results.xlsx", sheet_name="Evaluation")

# Creating feature - output dataset
x_train, y_train = split_feature_output(training,feature_names, show_time= False)
x_test, y_test = split_feature_output(evaluation,feature_names, show_time=False)

# Create whole dataset
data = training.append(evaluation)
data.head()
Out[2]:
cloud id_GEE index latitude longitude percentile1 percentile2 percentile3 percentile4 percentile5 persistence1 persistence2 persistence3 persistence4 persistence5 tree1 tree2 tree3
0 1 COPERNICUS/S2/20151203T105422_20151203T105811_... 0 38.065710 -0.786016 0 0 0 0 0 0 1 0 1 0 0 0 0
1 0 COPERNICUS/S2/20151203T105422_20151203T105811_... 1 38.349942 -1.563068 0 0 0 0 0 0 1 0 1 0 0 0 0
2 0 COPERNICUS/S2/20151203T105422_20151203T105811_... 2 37.994208 -1.775253 0 0 0 0 0 0 1 0 1 0 0 0 0
3 0 COPERNICUS/S2/20151203T105422_20151203T105811_... 3 38.725824 -0.892539 0 0 0 0 0 0 1 0 1 0 0 0 0
4 0 COPERNICUS/S2/20151203T105422_20151203T105811_... 4 38.025237 -0.878142 0 0 0 0 0 0 1 0 1 0 0 0 0

Let's see the repartition of the different columns:

In [3]:
# Plot the repartition of cloud - Not cloud pixels per methods
plot_methods_repartition(data, feature_names)

Correlation

  • The correlation matrix doesn't show a high correlation between the methods. Only the 3 tree methods have a high one.
  • If all methods are perfect, we expect to have a complete correlation. However, this is not the case, so we can build a model based on all those methods. The 3 tree methods are the column with the highest correlation to cloud column, meaning they are accurate methods.

Note: having a full correlation doesn't mean the methods are accurate (all can't predict the same wrong output).

In [4]:
# Plot size
plt.figure(figsize=(20,10)) 

data[["cloud"] + feature_names].corr().style.background_gradient(cmap='coolwarm').set_precision(2)
Out[4]:
<Figure size 1440x720 with 0 Axes>
Out[4]:
cloud percentile1 percentile2 percentile3 percentile4 percentile5 persistence1 persistence2 persistence3 persistence4 persistence5 tree1 tree2 tree3
cloud 1 0.55 0.49 0.27 0.49 0.6 0.38 0.41 0.32 0.42 0.49 0.71 0.87 0.89
percentile1 0.55 1 0.78 0.56 0.77 0.69 0.59 0.57 0.6 0.61 0.62 0.58 0.53 0.54
percentile2 0.49 0.78 1 0.67 0.84 0.61 0.5 0.61 0.66 0.62 0.59 0.61 0.53 0.51
percentile3 0.27 0.56 0.67 1 0.68 0.43 0.39 0.48 0.69 0.49 0.45 0.39 0.32 0.28
percentile4 0.49 0.77 0.84 0.68 1 0.59 0.51 0.58 0.68 0.65 0.63 0.58 0.53 0.5
percentile5 0.6 0.69 0.61 0.43 0.59 1 0.56 0.51 0.47 0.49 0.63 0.52 0.55 0.59
persistence1 0.38 0.59 0.5 0.39 0.51 0.56 1 0.61 0.45 0.54 0.49 0.38 0.36 0.36
persistence2 0.41 0.57 0.61 0.48 0.58 0.51 0.61 1 0.51 0.66 0.53 0.47 0.42 0.41
persistence3 0.32 0.6 0.66 0.69 0.68 0.47 0.45 0.51 1 0.53 0.47 0.45 0.35 0.33
persistence4 0.42 0.61 0.62 0.49 0.65 0.49 0.54 0.66 0.53 1 0.52 0.44 0.44 0.43
persistence5 0.49 0.62 0.59 0.45 0.63 0.63 0.49 0.53 0.47 0.52 1 0.5 0.5 0.51
tree1 0.71 0.58 0.61 0.39 0.58 0.52 0.38 0.47 0.45 0.44 0.5 1 0.73 0.76
tree2 0.87 0.53 0.53 0.32 0.53 0.55 0.36 0.42 0.35 0.44 0.5 0.73 1 0.89
tree3 0.89 0.54 0.51 0.28 0.5 0.59 0.36 0.41 0.33 0.43 0.51 0.76 0.89 1
<Figure size 1440x720 with 0 Axes>

III. Methods

III.A. Random Forest

1. Number of estimatros

I used the randomForestClassifier method from the sklearn package (doc). A lot of options are available. One important option is the number of samples used for building the randomForest model: n_estimators. Obviously, the bigger it is, the better the results are. However, using a lot of estimators is resource consuming.

The following figure has 2 plots:

  • Plot 1: accuracy over the number of estimators
  • Plot 2: time processing over the number of estimators
In [5]:
res_rand_forest = compute_accuracy_over_nb_estimators(x_train, x_test, y_train, y_test, show_plot=True, progress_bar=False)

Strangely enough, the best accuracy is for a small number of estinators. However, these variations are very little (occur after the 3rd digit). The first fluctuation could be understood as noise. This noise is reduced when the number of estimators increases.

In [6]:
print("20 best accuracies (sorted):\n", res_rand_forest.sort_values(by="accuracies", ascending=False).head(20))
20 best accuracies (sorted):
     nb_estimators_list  accuracies  times_fitting  times_predict
13                  18    0.965248       0.967473       0.100940
12                  16    0.965214       0.860514       0.089951
17                  26    0.965197       1.382246       0.142922
11                  14    0.965147       0.756571       0.078970
15                  22    0.965130       1.170344       0.121933
9                   10    0.965113       0.583683       0.058968
10                  12    0.965096       0.655627       0.068962
72                 490    0.965079      26.511489       2.681862
14                  20    0.965079       1.069399       0.110952
42                 190    0.965079       9.929966       1.023436
61                 380    0.965079      24.102888       2.343710
62                 390    0.965079      24.775353       2.506619
19                  30    0.965079       1.577138       0.164915
16                  24    0.965079       1.268308       0.132924
18                  28    0.965079       1.482184       0.153922
67                 440    0.965079      24.548120       2.359571
71                 480    0.965079      25.990214       2.567628
70                 470    0.965079      24.738462       2.553551
66                 430    0.965079      24.910846       2.316296
65                 420    0.965062      24.920879       2.393576

In the rest of the report, we will assume that : n_estimators=500. Here the model:

In [7]:
## Create Random forest model
# Next results assumes n_estimators = 500
nb_estimators = 500

# Create result dataframe + compute feature accuracy
res_accuracy = pd.DataFrame({"Methods": feature_names,
                             "Accuracy": [accuracy_score(data.cloud, data[col]) for col in feature_names]})

# Create Decision Tree classifer object
model_rf = RandomForestClassifier(n_estimators=nb_estimators, random_state=0)
# Fit model
model_rf.fit(x_train, y_train)
Out[7]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

III.A.2. Tree overview

The depths of the trees in the random forest model (for 500 estimators) are the following:

In [8]:
# Predict the response for test dataset
y_pred = model_rf.predict(x_test)

# Add the accuracy to the accuracy results
add_row(res_accuracy, ["randomForest", accuracy_score(y_test, y_pred)])

# Save the first tree of the forest as a png image
export_graphviz(model_rf.estimators_[0], out_file='tree_rf.dot',
                feature_names=list(training.columns[5:]),
                class_names=["Not Cloudy", "Cloudy"],
                rounded=True, precision=1)
(graph, ) = pydot.graph_from_dot_file('tree_rf.dot')
graph.write_png('tree_rf.png')


lengths = np.array([estimator.tree_.max_depth for estimator in model_rf.estimators_])

print("Length statistics of the trees in the randomForest model (n_estimatros=%d):" % (nb_estimators))
print("\t- Min: %3d\n\t- Max: %3d\n\t- Mean: %3.2f" % (min(lengths), max(lengths), sum(lengths)/len(lengths)))

nodes = np.array([estimator.tree_.node_count for estimator in model_rf.estimators_])

print("\nNodes count statistics of the trees in the randomForest model (n_estimatros=%d):" % (nb_estimators))
print("\t- Min: %3d\n\t- Max: %3d\n\t- Mean: %3.2f" % (min(nodes), max(nodes), sum(nodes)/len(nodes)))
Out[8]:
Methods Accuracy
0 randomForest 0.965079
1 percentile1 0.775145
2 percentile2 0.740978
3 percentile3 0.617571
4 percentile4 0.737553
5 percentile5 0.790466
6 persistence1 0.685592
7 persistence2 0.705503
8 persistence3 0.645212
9 persistence4 0.708464
10 persistence5 0.741443
11 tree1 0.838377
12 tree2 0.932409
13 tree3 0.943470
Length statistics of the trees in the randomForest model (n_estimatros=500):
	- Min:  13
	- Max:  13
	- Mean: 13.00

Nodes count statistics of the trees in the randomForest model (n_estimatros=500):
	- Min: 1493
	- Max: 1727
	- Mean: 1589.74

The first tree of the decision tree looks like this: First tree of the decision tree model

III.A.3. Features importance

The following graph shows the importance per feature. The 3 tree methods are the most important. This is consistent with the correlation matrix saw in section II.

In [9]:
tmp = pd.DataFrame({"Methods": feature_names, 
              "Importances": model_rf.feature_importances_}).sort_values(by="Importances", ascending=False)
g = sns.barplot(x='Methods', y='Importances', data=tmp, )
g.set_xticklabels(g.get_xticklabels(), rotation=45)
plt.title("Importance per feature (Random forest model)")
plt.show()
Out[9]:
[Text(0, 0, 'tree3'),
 Text(0, 0, 'tree2'),
 Text(0, 0, 'tree1'),
 Text(0, 0, 'percentile5'),
 Text(0, 0, 'percentile1'),
 Text(0, 0, 'percentile2'),
 Text(0, 0, 'persistence5'),
 Text(0, 0, 'percentile4'),
 Text(0, 0, 'persistence2'),
 Text(0, 0, 'persistence4'),
 Text(0, 0, 'persistence1'),
 Text(0, 0, 'persistence3'),
 Text(0, 0, 'percentile3')]
Out[9]:
Text(0.5, 1.0, 'Importance per feature (Random forest model)')

The following graph shows the accuracy when the number of features used in the random forest decreases. Between each iteration is removed the feature having the smallest importance.

In [10]:
accuracy_over_nb_estimatros = accuracy_over_number_of_feature(training, evaluation, feature_names, nb_estimators)
accuracy_over_nb_estimatros \
            .plot('Number_of_features', 'Accuracy', rot=0, title="Importance per feature (Random forest model)") \
            .invert_xaxis()

III.B. Others methods

I have also tried to build two other models: a logistic regression model and a decision tree model.

1. Decision tree

I used the DecisionTreeClassifier method from the sklearn package to build the decision tree model. The tree is similar to the one built in the random forest model. The decision tree is also deep:

In [11]:
# Create Decision Tree classifer object
tree = DecisionTreeClassifier()
# Train Decision Tree Classifer
tree = tree.fit(x_train, y_train)
# Predict the response for test dataset
y_pred = tree.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
add_row(res_accuracy, ["decisionTree", accuracy])

# Save the tree as a png image
export_graphviz(model_rf.estimators_[0], out_file='decision_tree.dot',
                feature_names=list(training.columns[5:]),
                class_names=["Not Cloudy", "Cloudy"],
                rounded=True, precision=1)
(graph, ) = pydot.graph_from_dot_file('decision_tree.dot')
graph.write_png('decision_tree.png')

print("Tree depth: ", tree.tree_.max_depth)
print("Number of node: ", tree.tree_.node_count)
Out[11]:
Methods Accuracy
0 decisionTree 0.964943
1 randomForest 0.965079
2 percentile1 0.775145
3 percentile2 0.740978
4 percentile3 0.617571
5 percentile4 0.737553
6 percentile5 0.790466
7 persistence1 0.685592
8 persistence2 0.705503
9 persistence3 0.645212
10 persistence4 0.708464
11 persistence5 0.741443
12 tree1 0.838377
13 tree2 0.932409
14 tree3 0.943470
Tree depth:  13
Number of node:  1649

The tree is similar to the ones from the random forest model: decision tree

In [12]:
tmp = dict(zip(feature_names, tree.feature_importances_))
tmp = {key:[value] for key, value in tmp.items()}
tmp = pd.DataFrame(tmp).T
tmp.columns = ["feature_importances"]
tmp = tmp.sort_values(by="feature_importances", ascending=False)
g = sns.barplot(x = tmp.index , y="feature_importances", data=tmp)
g.set_xticklabels(g.get_xticklabels(), rotation=45)
plt.title("Feature importance decision tree")
plt.show()
Out[12]:
[Text(0, 0, 'tree3'),
 Text(0, 0, 'tree2'),
 Text(0, 0, 'percentile5'),
 Text(0, 0, 'percentile3'),
 Text(0, 0, 'persistence4'),
 Text(0, 0, 'persistence3'),
 Text(0, 0, 'persistence2'),
 Text(0, 0, 'persistence1'),
 Text(0, 0, 'persistence5'),
 Text(0, 0, 'percentile1'),
 Text(0, 0, 'percentile2'),
 Text(0, 0, 'percentile4'),
 Text(0, 0, 'tree1')]
Out[12]:
Text(0.5, 1.0, 'Feature importance decision tree')

III.B.2. Logistic regression

I use the LogisticRegression function from sklearn package. The model output is a score between 0 and 1. Then, I classify the output by comparing them to 0.5.

In [13]:
from sklearn.linear_model import LogisticRegression
# Create model
model = LogisticRegression(solver='liblinear')
# Fit model
model.fit(x_train, y_train)
# Predict the response for test dataset (proba output)
y_pred = model.predict(x_test)
# Categorize the output
y_pred = np.where(y_pred > 0.5, 1, 0)

accuracy = accuracy_score(y_test, y_pred)
add_row(res_accuracy, ["logisticRegression", accuracy])
pass
Out[13]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)
Out[13]:
Methods Accuracy
0 logisticRegression 0.951926
1 decisionTree 0.964943
2 randomForest 0.965079
3 percentile1 0.775145
4 percentile2 0.740978
5 percentile3 0.617571
6 percentile4 0.737553
7 percentile5 0.790466
8 persistence1 0.685592
9 persistence2 0.705503
10 persistence3 0.645212
11 persistence4 0.708464
12 persistence5 0.741443
13 tree1 0.838377
14 tree2 0.932409
15 tree3 0.943470

V. With the 5 most important features...

In the next section we deal with only the 5 most important features found in the plot III.A.3 : "tree3", "tree2", "tree1", "percentile5", "percentile1"

V.1. Features importance

Correlation between those 5 features

  • Criteria: remove the feature among the tree2 and tree1 having the highest correlation with the features percentile5 and percentile1.
  • Results: the tree1 with a correlation of 0.58
In [14]:
feature_names = ["tree3", "tree2", "tree1", "percentile5", "percentile1"]

data[feature_names].corr().style.background_gradient(cmap='coolwarm').set_precision(2)
Out[14]:
tree3 tree2 tree1 percentile5 percentile1
tree3 1 0.89 0.76 0.59 0.54
tree2 0.89 1 0.73 0.55 0.53
tree1 0.76 0.73 1 0.52 0.58
percentile5 0.59 0.55 0.52 1 0.69
percentile1 0.54 0.53 0.58 0.69 1

Features importances according to which feature is removed

The following plot shows the importance per feature, according to the column removed. We can see several points:

  • The percentile1 method has a really little influence in all cases (always less than 5%).
  • The tree3 and tree2 methods always have the highest importance as far from the other features (always twice as much).
  • The order is always the same whatever the feature removed.
  • Removing the tree1 feature, according to the criteria defined just above, slightly reduces the precision by 0.03 %
In [15]:
feat_importance = feature_importance_per_model(training, evaluation, feature_names, nb_estimators=500)

plot_feature_importance_per_model(feat_importance, feature_names)

In the following section, we only consider the following features: tree3, tree2, percentile5 and percentile1.

V.2. Cuttof

Another important parameter in classification is the cuttof threshold. It gives the opportunity to tune the predictions (false positive, true positive...).

In our study, we are trying to predict if a cloud is a pixel or not. We want to detect cloudy pixels, rid of them and then interpolate them. Because we are repairing cloudy pixels in the next process step, predicting a pixel as non cloud whereas it is, does more harm than repairing a non cloudy pixel classed as cloudy.

In the following section, we will try to find a cuttof value that reduces the rate of False Negative (pixels predicted as cloud whereas they're not).

The following confusion matrix have the following structure: rows are the predictions while columns are the actual class.

Class Cloud Not cloud
Cloud True positive False positive
Not cloud False negative True negative
In [16]:
feature_names = ["tree3", "tree2", "percentile5", "percentile1"]
nb_estimators = 500

# Subset the datasets
sub_training = training[["cloud"] + feature_names]
sub_evaluation = evaluation[["cloud"] + feature_names]
# Creating feature - output dataset
x_train, y_train = split_feature_output(sub_training, feature_names, show_time=False)
x_test, y_test = split_feature_output(sub_evaluation, feature_names, show_time=False)


# Create Decision Tree classifer object
model_rf = RandomForestClassifier(n_estimators=nb_estimators, random_state=0)
# Fit model
_ = model_rf.fit(x_train, y_train)

# Eval the model with proba output
y_pred = model_rf.predict(x_test)
y_score = model_rf.predict_proba(x_test)

# Draw ROC Curve
fpr, tpr, auc_thresholds = roc_curve(y_test, y_score[:, 1])
plotly_roc_curve(fpr, tpr, auc_thresholds)

Choosing a threshold equals to 0.2972 leads to reduce the number of False Negative. However, the number of False Negative is still higher than the number of False Positive.

In [17]:
threshold = 0.2972
names = ["Cloud", "Not cloud"]

y_pred_29 = np.where(y_score[:,1] > threshold, 1, 0)
cm_29 = np.transpose(confusion_matrix(y_test, y_pred_29, labels=[1,0]))
cm_29 = pd.DataFrame(cm_29, columns=names, index=names)
# sum((y_pred_29 == 1) & (y_test == 0))
# sum((y_pred_29 == 1) & (y_test == 1))
# sum((y_pred_29 == 0) & (y_test == 1))

cm_50 = np.transpose(confusion_matrix(y_test, y_pred, labels=[1,0]))
cm_50 = pd.DataFrame(cm_50, columns=names, index=names)
# sum((y_pred == 1) & (y_test == 0))
# sum((y_pred == 1) & (y_test == 1))
# sum((y_pred == 0) & (y_test == 1))

accuracy = accuracy_score(y_test, y_pred_29)
res_accuracy = add_row(res_accuracy, ["randomForest V2", accuracy])

cm_50_styler = cm_50.style.set_table_attributes("style='display:inline; padding:10px;'").set_caption(
                            '<center>Confusion matrix <br>(threshold {:6.4f}):<center>'.format(0.5))
cm_29_styler = cm_29.style.set_table_attributes("style='display:inline; padding:10px;'").set_caption(
                            '<center>Confusion matrix <br>(threshold {:6.4f}):<center>'.format(threshold))

display_html(cm_50_styler._repr_html_() + cm_29_styler._repr_html_(), raw=True)

print("Accuracy {:6.4f}: ".format(0.5), accuracy_score(y_test, y_pred))
print("Accuracy {:6.4f}: ".format(threshold), accuracy_score(y_test, y_pred_29))
Confusion matrix
(threshold 0.5000):
Cloud Not cloud
Cloud 27683 863
Not cloud 1928 28602
Confusion matrix
(threshold 0.2972):
Cloud Not cloud
Cloud 27929 1242
Not cloud 1682 28223
Accuracy 0.5000:  0.9527557722256077
Accuracy 0.2972:  0.9505044349651297

V.3. Precision - recall

The precision and recall for a threshold equals to 0.2972 looks good (green circle) as shows the following plot.

In [18]:
precision, recall, thresholds = precision_recall_curve(y_test, y_score[:, 1])

plotly_precision_recall_vs_threshold(precision, recall, thresholds, 0.2972)

IV. Results

The results of the different models (accuracy):

In [19]:
plot_accuracy_over_methods(res_accuracy)
res_accuracy
Out[19]:
Methods Accuracy
0 randomForest V2 0.950504
1 logisticRegression 0.951926
2 decisionTree 0.964943
3 randomForest 0.965079
4 percentile1 0.775145
5 percentile2 0.740978
6 percentile3 0.617571
7 percentile4 0.737553
8 percentile5 0.790466
9 persistence1 0.685592
10 persistence2 0.705503
11 persistence3 0.645212
12 persistence4 0.708464
13 persistence5 0.741443
14 tree1 0.838377
15 tree2 0.932409
16 tree3 0.943470
In [ ]: